Lesson 7. Attribute and Spatial Joins

Now that we understand the logic of spatial relationship queries, let’s take a look at another fundamental spatial operation that relies on them.

This operation, called a spatial join, is the process by which we can leverage the spatial relationships between distinct datasets to merge their information into a new output dataset.

This operation can be thought as the spatial equivalent of an attribute join, in which multiple tabular datasets can be merged by aligning matching values in a common column that they both contain. Thus, we’ll start by developing an understanding of this operation first!


Instructor Notes

library(sf)
library(tmap)

7.0 Data Input and Prep

Let’s read in a table of data from the US Census’ 5-year American Community Survey (ACS5).

# Read in the ACS5 data for CA into an `sf` object.
# Note: We force the FIPS_11_digit to be read in as a string to preserve any leading zeroes.
acs5_df = read.csv("notebook_data/census/ACS5yr/census_variables_CA.csv")
head(acs5_df)
##                                            NAME c_race c_white c_black c_asian
## 1 Census Tract 4012, Alameda County, California   2456    1287     476     259
## 2 Census Tract 4013, Alameda County, California   3983     845    1348     827
## 3 Census Tract 4014, Alameda County, California   4340     713    1902     593
## 4 Census Tract 4015, Alameda County, California   2080     563    1064     215
## 5 Census Tract 4016, Alameda County, California   1889     324     960     247
## 6 Census Tract 4017, Alameda County, California   2544     553     634     302
##   c_latinx c_race_moe c_white_moe c_black_moe c_asian_moe c_latinx_moe
## 1      283        213         191         116         124          195
## 2      796        680         186         411         283          236
## 3      981        644         314         440         198          620
## 4      190        369         222         283         116          105
## 5      274        400         135         376         164          149
## 6      945        314         140         259         144          247
##   state_fips county_fips tract_fips med_rent med_hhinc c_tenants c_owners
## 1          6           1     401200     1251     62064      1245      544
## 2          6           1     401300      927     35030      1743      213
## 3          6           1     401400      884     28264      1447      274
## 4          6           1     401500      718     38684       992      449
## 5          6           1     401600     1088     32663       707      140
## 6          6           1     401700     1147     63052      1083      398
##   c_renters med_rent_moe med_hhinc_moe c_tenants_moe c_owners_moe c_renters_moe
## 1       701          128         14598            60           94            91
## 2      1530           64          6917            87           69           104
## 3      1173           64          5802           106           75           125
## 4       543           80         12077            82           96           114
## 5       567           79         16290            88           61            90
## 6       685          214         19070            95           85           116
##   c_movers c_stay c_movelocal c_movecounty c_movestate c_moveabroad
## 1     2448   1995         253          143          25           32
## 2     3978   2434        1114          252          90           88
## 3     4269   3448         699           76          27           19
## 4     2080   1750         211          112           7            0
## 5     1860   1545         148          153           4           10
## 6     2535   2001         350          116          34           34
##   c_movers_moe c_stay_moe c_movelocal_moe c_movecounty_moe c_movestate_moe
## 1          213        188             187               66              39
## 2          680        324             512              115              75
## 3          605        618             246               52              34
## 4          369        353             112              103              11
## 5          393        368              89              149               6
## 6          312        282             193               62              47
##   c_moveabroad_moe c_commute c_car c_carpool c_transit c_bike c_walk
## 1               31      1460   805        94       276    122     85
## 2               98      2046   698       223       801     37    214
## 3               26      1595   751        34       408    186    163
## 4               12       982   493        89       226     47     17
## 5               16       603   344        74       107     38      0
## 6               39      1585   648       284       373     21     29
##   c_commute_moe c_car_moe c_carpool_moe c_transit_moe c_bike_moe c_walk_moe
## 1           191       144           109           105         60         43
## 2           468       183           133           264         36        121
## 3           396       211            28           179        108        151
## 4           222       145            67            82         34         15
## 5           170       128            74           120         44         12
## 6           225       168           124           141         22         44
##   year FIPS_11_digit   p_white   p_black   p_asian   p_latinx  p_owners
## 1 2013    6001401200 0.5240228 0.1938111 0.1054560 0.11522801 0.4369478
## 2 2013    6001401300 0.2121516 0.3384384 0.2076324 0.19984936 0.1222031
## 3 2013    6001401400 0.1642857 0.4382488 0.1366359 0.22603687 0.1893573
## 4 2013    6001401500 0.2706731 0.5115385 0.1033654 0.09134615 0.4526210
## 5 2013    6001401600 0.1715193 0.5082054 0.1307570 0.14505029 0.1980198
## 6 2013    6001401700 0.2173742 0.2492138 0.1187107 0.37146226 0.3674977
##   p_renters    p_stay p_movelocal p_movecounty p_movestate p_moveabroad
## 1 0.5630522 0.8149510  0.10334967   0.05841503 0.010212418  0.013071895
## 2 0.8777969 0.6118653  0.28004022   0.06334842 0.022624434  0.022121669
## 3 0.8106427 0.8076833  0.16373858   0.01780276 0.006324666  0.004450691
## 4 0.5473790 0.8413462  0.10144231   0.05384615 0.003365385  0.000000000
## 5 0.8019802 0.8306452  0.07956989   0.08225806 0.002150538  0.005376344
## 6 0.6325023 0.7893491  0.13806706   0.04575937 0.013412229  0.013412229
##       p_car  p_carpool p_transit     p_bike     p_walk
## 1 0.5513699 0.06438356 0.1890411 0.08356164 0.05821918
## 2 0.3411535 0.10899316 0.3914956 0.01808407 0.10459433
## 3 0.4708464 0.02131661 0.2557994 0.11661442 0.10219436
## 4 0.5020367 0.09063136 0.2301426 0.04786151 0.01731161
## 5 0.5704809 0.12271973 0.1774461 0.06301824 0.00000000
## 6 0.4088328 0.17917981 0.2353312 0.01324921 0.01829653

Brief summary of the data:

Below is a table of the variables in this table. They were combined from different ACS 5 year tables.

NOTE:

  • variables that start with c_ are counts
  • variables that start with med_ are medians
  • variables that end in _moe are margin of error estimates
  • variables that start with p_ are proportions calcuated from the counts divided by the table denominator (the total count for whom that variable was assessed)
Variable Description
c_race Total population
c_white Total white non-Latinx
c_black Total black and African American non-Latinx
c_asian Total Asian non-Latinx
c_latinx Total Latinx
state_fips State level FIPS code
county_fips County level FIPS code
tract_fips Tracts level FIPS code
med_rent Median rent
med_hhinc Median household income
c_tenants Total tenants
c_owners Total owners
c_renters Total renters
c_movers Total number of people who moved
c_stay Total number of people who stayed
c_movelocal Number of people who moved locally
c_movecounty Number of people who moved counties
c_movestate Number of people who moved states
c_moveabroad Number of people who moved abroad
c_commute Total number of commuters
c_car Number of commuters who use a car
c_carpool Number of commuters who carpool
c_transit Number of commuters who use public transit
c_bike Number of commuters who bike
c_walk Number of commuters who bike
year ACS data year
FIPS_11_digit 11-digit FIPS code
…. and more

We’re going to drop all of our moe columns by identifying all of those that end with _moe. We can do that in two steps, first by using filter to identify columns that contain the string _moe.

tidyverse will help with this!

library(tidyverse) 
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.1.1     ✓ dplyr   1.0.5
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
acs5_df = acs5_df %>% select(-contains("_moe"))

Unfortunately, when this dataset reads in, the 11-digit FIPS codes that should be strings actually read in as numerics, and thus the leading 0 gets truncated. We’re going to need those FIPS code in the correct format later, so let’s reformat them now.

# recast the FIPS 11-digit codes as strings, pasting a 0 at the front of each
acs5_df$FIPS_11_digit = paste0('0', acs5_df$FIPS_11_digit)

And lastly, let’s grab only the rows for year 2018 and county FIPS code 1 (i.e. Alameda County)

acs5_df_ac = acs5_df[acs5_df$year==2018 & acs5_df$county_fips==1, ]

Now, take another look at the dataframe

head(acs5_df_ac)
##                                                  NAME c_race c_white c_black
## 8324 Census Tract 4415.01, Alameda County, California   6570     677     111
## 8325    Census Tract 4047, Alameda County, California   2079    1515     134
## 8326    Census Tract 4425, Alameda County, California   7748    1430     375
## 8327    Census Tract 4503, Alameda County, California   5301    2597      96
## 8328 Census Tract 4506.07, Alameda County, California   5971    2832     324
## 8329    Census Tract 4049, Alameda County, California   3973    2383     394
##      c_asian c_latinx state_fips county_fips tract_fips med_rent med_hhinc
## 8324    4740      570          6           1     441501     1883    152394
## 8325     199      175          6           1     404700     3250    181000
## 8326    3379     1904          6           1     442500     1999     95323
## 8327    1077     1315          6           1     450300     2626    121131
## 8328    1726      804          6           1     450607     1837     96061
## 8329     442      337          6           1     404900     1446    111846
##      c_tenants c_owners c_renters c_movers c_stay c_movelocal c_movecounty
## 8324      1796     1634       162     6491   6010         257           68
## 8325       790      680       110     2043   1822          58           77
## 8326      2264     1245      1019     7628   6858         557           29
## 8327      1814     1249       565     5249   4668         134          326
## 8328      2445      883      1562     5905   4735         715          254
## 8329      1874     1095       779     3939   3647         123          100
##      c_movestate c_moveabroad c_commute c_car c_carpool c_transit c_bike c_walk
## 8324         129           27      2317  1765       264       127     28      8
## 8325          65           21      1075   572       191       170      7      6
## 8326          23          161      3330  2382       375       214     59     34
## 8327         114            7      2819  2112       183       265     28     33
## 8328         201            0      3315  2316       440       193     71    114
## 8329          34           35      2211  1030       303       522     48      0
##      year FIPS_11_digit   p_white    p_black   p_asian   p_latinx  p_owners
## 8324 2018   06001441501 0.1030441 0.01689498 0.7214612 0.08675799 0.9097996
## 8325 2018   06001404700 0.7287157 0.06445406 0.0957191 0.08417508 0.8607595
## 8326 2018   06001442500 0.1845638 0.04839959 0.4361125 0.24574084 0.5499117
## 8327 2018   06001450300 0.4899076 0.01810979 0.2031692 0.24806640 0.6885336
## 8328 2018   06001450607 0.4742924 0.05426227 0.2890638 0.13465081 0.3611452
## 8329 2018   06001404900 0.5997986 0.09916939 0.1112509 0.08482255 0.5843116
##       p_renters    p_stay p_movelocal p_movecounty p_movestate p_moveabroad
## 8324 0.09020045 0.9258974  0.03959328  0.010476044 0.019873671  0.004159606
## 8325 0.13924051 0.8918257  0.02838962  0.037689672 0.031815957  0.010279001
## 8326 0.45008834 0.8990561  0.07302045  0.003801783 0.003015207  0.021106450
## 8327 0.31146637 0.8893122  0.02552867  0.062107068 0.021718423  0.001333587
## 8328 0.63885481 0.8018628  0.12108383  0.043014395 0.034038950  0.000000000
## 8329 0.41568837 0.9258695  0.03122620  0.025387154 0.008631632  0.008885504
##          p_car  p_carpool  p_transit      p_bike      p_walk
## 8324 0.7617609 0.11394044 0.05481226 0.012084592 0.003452741
## 8325 0.5320930 0.17767442 0.15813953 0.006511628 0.005581395
## 8326 0.7153153 0.11261261 0.06426426 0.017717718 0.010210210
## 8327 0.7492018 0.06491664 0.09400497 0.009932600 0.011706279
## 8328 0.6986425 0.13273002 0.05822021 0.021417798 0.034389140
## 8329 0.4658526 0.13704206 0.23609227 0.021709634 0.000000000

Now let’s also read in our census tracts again!

tracts_sf = st_read("./notebook_data/census/Tracts/cb_2018_06_tract_500k.shp", )
## Reading layer `cb_2018_06_tract_500k' from data source `/Users/pattyf/Documents/Dlab/workshops/2021/Geospatial-Fundamentals-in-R-with-sf/notebook_data/census/Tracts/cb_2018_06_tract_500k.shp' using driver `ESRI Shapefile'
## Simple feature collection with 8041 features and 9 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -124.4096 ymin: 32.53416 xmax: -114.1312 ymax: 42.00948
## Geodetic CRS:  NAD83
head(tracts_sf)
## Simple feature collection with 6 features and 9 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -122.5009 ymin: 37.92535 xmax: -120.3345 ymax: 39.30875
## Geodetic CRS:  NAD83
##   STATEFP COUNTYFP TRACTCE             AFFGEOID       GEOID    NAME LSAD
## 1      06      009  000300 1400000US06009000300 06009000300       3   CT
## 2      06      011  000300 1400000US06011000300 06011000300       3   CT
## 3      06      013  303102 1400000US06013303102 06013303102 3031.02   CT
## 4      06      013  303202 1400000US06013303202 06013303202 3032.02   CT
## 5      06      013  303203 1400000US06013303203 06013303203 3032.03   CT
## 6      06      013  307102 1400000US06013307102 06013307102 3071.02   CT
##       ALAND AWATER                       geometry
## 1 457009794 394122 MULTIPOLYGON (((-120.764 38...
## 2 952744514 195376 MULTIPOLYGON (((-122.5001 3...
## 3   6507019      0 MULTIPOLYGON (((-121.7294 3...
## 4   3725528      0 MULTIPOLYGON (((-121.7235 3...
## 5   6354210      0 MULTIPOLYGON (((-121.7449 3...
## 6   1603153      0 MULTIPOLYGON (((-121.8226 3...
tracts_sf_ac = tracts_sf[tracts_sf$COUNTYFP == '001',]
plot(tracts_sf_ac$geometry)

7.1 Attribute Joins

Attribute Joins between sf data.frames and plain data.frames

We just mapped the census tracts. But what makes a map powerful is when you map the data associated with the locations.

  • tracts_sf_ac: These are polygon data in this sf data.frame. However, as we saw with the head command, it contains no attributes of interest!

  • acs5_df_ac: These are 2018 ACS data attributes for CA census tracts read in from a CSV file (‘census_variables_CA.csv’), into a regular data.frame. However, this data has no geometry column!

In order to map the ACS data we need to associate it with the tracts. We can do that by joining the columns from acs5_df_ac to the columns of tracts_gdf_ac using a common column as the key for matching rows. This process is called an attribute join.

Question

The image above gives us a nice conceptual summary of the types of joins we could run.

  1. In general, why might we choose one type of join over another?
  2. In our case, do we want an inner, left, right, or outer (AKA ‘full’) join?

(NOTE: You can read more about merging sf and plain data.frames here.)

Okay, here we go!

Let’s take a look at the common column in both our data.frames.

head(tracts_sf_ac['GEOID'])
## Simple feature collection with 6 features and 1 field
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -122.3159 ymin: 37.66497 xmax: -122.1062 ymax: 37.84792
## Geodetic CRS:  NAD83
##          GEOID                       geometry
## 27 06001425101 MULTIPOLYGON (((-122.3142 3...
## 28 06001428600 MULTIPOLYGON (((-122.2799 3...
## 29 06001432600 MULTIPOLYGON (((-122.1675 3...
## 30 06001433200 MULTIPOLYGON (((-122.1667 3...
## 31 06001433900 MULTIPOLYGON (((-122.1209 3...
## 32 06001436100 MULTIPOLYGON (((-122.1329 3...
head(acs5_df_ac['FIPS_11_digit'])
##      FIPS_11_digit
## 8324   06001441501
## 8325   06001404700
## 8326   06001442500
## 8327   06001450300
## 8328   06001450607
## 8329   06001404900

Note that they are not named the same thing.

    That's okay! We just need to know that they contain the same information.

Also note that they are not in the same order.

    That's not only okay... That's the point! (If they were in the same order already then we could just join them side by side, without having R find and line up the matching rows from each!)

Let’s do a left join to keep all of the census tracts in Alameda County and only the ACS data for those tracts.

NOTE: To figure out how to do this we could always take a peek at the documentation by calling ?base::merge.

?base::merge
# Left join keeps all tracts and the acs data for those tracts
tracts_acs_sf_ac = base::merge(tracts_sf_ac, acs5_df_ac, by.x = 'GEOID', by.y = "FIPS_11_digit", all.x=TRUE)

# what is the class of the  output data object
class(tracts_acs_sf_ac)
## [1] "sf"         "data.frame"

What is the class of the output if you join the tracts to the ACS data (i.e reverse the order of the inputs)?

acs_and_tracts_ac = base::merge(acs5_df_ac, tracts_sf_ac, by.y = 'GEOID', by.x = "FIPS_11_digit", all.x=TRUE)
class(acs_and_tracts_ac)
## [1] "data.frame"

ORDER MATTERS with base::merge

  • If you use base::merge to join a dataframe to an spatial dataframe the output is a spatial df.

  • If you use base::merge to join a spatial dataframe to a dataframe the output is as dataframe!

Take a look at the output

# take a look at the output
head(tracts_acs_sf_ac)
## Simple feature collection with 6 features and 52 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -122.2694 ymin: 37.83454 xmax: -122.2124 ymax: 37.88544
## Geodetic CRS:  NAD83
##         GEOID STATEFP COUNTYFP TRACTCE             AFFGEOID NAME.x LSAD   ALAND
## 1 06001400100      06      001  400100 1400000US06001400100   4001   CT 6894340
## 2 06001400200      06      001  400200 1400000US06001400200   4002   CT  586561
## 3 06001400300      06      001  400300 1400000US06001400300   4003   CT 1105851
## 4 06001400400      06      001  400400 1400000US06001400400   4004   CT  715616
## 5 06001400500      06      001  400500 1400000US06001400500   4005   CT  590307
## 6 06001400600      06      001  400600 1400000US06001400600   4006   CT  297856
##   AWATER                                        NAME.y c_race c_white c_black
## 1      0 Census Tract 4001, Alameda County, California   3115    2055     128
## 2      0 Census Tract 4002, Alameda County, California   2025    1436      59
## 3      0 Census Tract 4003, Alameda County, California   5000    3458     380
## 4      0 Census Tract 4004, Alameda County, California   3843    2551     229
## 5      0 Census Tract 4005, Alameda County, California   3786    1885     990
## 6      0 Census Tract 4006, Alameda County, California   1638     817     343
##   c_asian c_latinx state_fips county_fips tract_fips med_rent med_hhinc
## 1     592      104          6           1     400100     3501    200893
## 2     183      178          6           1     400200     1902    160536
## 3     535      364          6           1     400300     1481     94732
## 4     373      420          6           1     400400     1624    113036
## 5     264      334          6           1     400500     1627    103846
## 6     144      124          6           1     400600     1640    127232
##   c_tenants c_owners c_renters c_movers c_stay c_movelocal c_movecounty
## 1      1297     1158       139     3084   2514         202          120
## 2       855      532       323     2012   1827          40           46
## 3      2441     1057      1384     4948   4159         223          289
## 4      1750      653      1097     3754   3440         113          133
## 5      1622      605      1017     3745   3065         309          180
## 6       657      283       374     1587   1221         152          146
##   c_movestate c_moveabroad c_commute c_car c_carpool c_transit c_bike c_walk
## 1         219           29      1542   864       165       271      0     10
## 2          39           60      1211   500        40       426     62     57
## 3         156          121      2975  1252       177       835    202    171
## 4          46           22      2293   933       240       588    188     92
## 5         145           46      2325   983       156       619    177     94
## 6          68            0      1022   370       120       268     93      3
##   year   p_white    p_black    p_asian   p_latinx  p_owners p_renters    p_stay
## 1 2018 0.6597111 0.04109149 0.19004815 0.03338684 0.8928296 0.1071704 0.8151751
## 2 2018 0.7091358 0.02913580 0.09037037 0.08790123 0.6222222 0.3777778 0.9080517
## 3 2018 0.6916000 0.07600000 0.10700000 0.07280000 0.4330193 0.5669807 0.8405416
## 4 2018 0.6638043 0.05958886 0.09705959 0.10928962 0.3731429 0.6268571 0.9163559
## 5 2018 0.4978870 0.26148970 0.06973059 0.08821976 0.3729963 0.6270037 0.8184246
## 6 2018 0.4987790 0.20940171 0.08791209 0.07570208 0.4307458 0.5692542 0.7693762
##   p_movelocal p_movecounty p_movestate p_moveabroad     p_car  p_carpool
## 1  0.06549935   0.03891051  0.07101167  0.009403372 0.5603113 0.10700389
## 2  0.01988072   0.02286282  0.01938370  0.029821074 0.4128819 0.03303055
## 3  0.04506871   0.05840744  0.03152789  0.024454325 0.4208403 0.05949580
## 4  0.03010123   0.03542888  0.01225360  0.005860416 0.4068905 0.10466638
## 5  0.08251001   0.04806409  0.03871829  0.012283044 0.4227957 0.06709677
## 6  0.09577820   0.09199748  0.04284814  0.000000000 0.3620352 0.11741683
##   p_transit     p_bike      p_walk                       geometry
## 1 0.1757458 0.00000000 0.006485084 MULTIPOLYGON (((-122.2469 3...
## 2 0.3517754 0.05119736 0.047068538 MULTIPOLYGON (((-122.2574 3...
## 3 0.2806723 0.06789916 0.057478992 MULTIPOLYGON (((-122.2642 3...
## 4 0.2564326 0.08198866 0.040122111 MULTIPOLYGON (((-122.2618 3...
## 5 0.2662366 0.07612903 0.040430108 MULTIPOLYGON (((-122.2694 3...
## 6 0.2622309 0.09099804 0.002935421 MULTIPOLYGON (((-122.2681 3...

Let’s check that we have all the variables we have in our dataset now.

colnames(tracts_acs_sf_ac)
##  [1] "GEOID"        "STATEFP"      "COUNTYFP"     "TRACTCE"      "AFFGEOID"    
##  [6] "NAME.x"       "LSAD"         "ALAND"        "AWATER"       "NAME.y"      
## [11] "c_race"       "c_white"      "c_black"      "c_asian"      "c_latinx"    
## [16] "state_fips"   "county_fips"  "tract_fips"   "med_rent"     "med_hhinc"   
## [21] "c_tenants"    "c_owners"     "c_renters"    "c_movers"     "c_stay"      
## [26] "c_movelocal"  "c_movecounty" "c_movestate"  "c_moveabroad" "c_commute"   
## [31] "c_car"        "c_carpool"    "c_transit"    "c_bike"       "c_walk"      
## [36] "year"         "p_white"      "p_black"      "p_asian"      "p_latinx"    
## [41] "p_owners"     "p_renters"    "p_stay"       "p_movelocal"  "p_movecounty"
## [46] "p_movestate"  "p_moveabroad" "p_car"        "p_carpool"    "p_transit"   
## [51] "p_bike"       "p_walk"       "geometry"

Question

It’s always important to run sanity checks on our results, at each step of the way!

In this case, how many rows and columns should we have?

print("Rows and columns in the Alameda County Census tract spatial df:")
## [1] "Rows and columns in the Alameda County Census tract spatial df:"
print(dim(tracts_sf_ac))
## [1] 360  10
print("Row and columns in the ACS5 2018 data for Alameda County:")
## [1] "Row and columns in the ACS5 2018 data for Alameda County:"
print(dim(acs5_df_ac))
## [1] 361  44
print("Rows and columns in the Alameda County Census tract spatial df joined to the ACS data:")
## [1] "Rows and columns in the Alameda County Census tract spatial df joined to the ACS data:"
print(dim(tracts_acs_sf_ac))
## [1] 360  53

Let’s save out our merged data so we can use it in the final notebook.

st_write(tracts_acs_sf_ac, './outdata/tracts_acs_sdf_ac.json', driver='GeoJSON', delete_dsn=T)
## Deleting source `./outdata/tracts_acs_sdf_ac.json' using driver `GeoJSON'
## Writing layer `tracts_acs_sdf_ac' to data source `./outdata/tracts_acs_sdf_ac.json' using driver `GeoJSON'
## Writing 360 features with 52 fields and geometry type Multi Polygon.

Exercise: Choropleth Map

We can now make choropleth maps using our attribute joined geodataframe. Go ahead and pick one variable to color the map, then map it using tmap (since it’s too easy using the plot method). You can go back to lesson 5 if you need a refresher on how to make this!

To see the solution, look at the hidden text below.

head(tracts_acs_sf_ac)
## Simple feature collection with 6 features and 52 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -122.2694 ymin: 37.83454 xmax: -122.2124 ymax: 37.88544
## Geodetic CRS:  NAD83
##         GEOID STATEFP COUNTYFP TRACTCE             AFFGEOID NAME.x LSAD   ALAND
## 1 06001400100      06      001  400100 1400000US06001400100   4001   CT 6894340
## 2 06001400200      06      001  400200 1400000US06001400200   4002   CT  586561
## 3 06001400300      06      001  400300 1400000US06001400300   4003   CT 1105851
## 4 06001400400      06      001  400400 1400000US06001400400   4004   CT  715616
## 5 06001400500      06      001  400500 1400000US06001400500   4005   CT  590307
## 6 06001400600      06      001  400600 1400000US06001400600   4006   CT  297856
##   AWATER                                        NAME.y c_race c_white c_black
## 1      0 Census Tract 4001, Alameda County, California   3115    2055     128
## 2      0 Census Tract 4002, Alameda County, California   2025    1436      59
## 3      0 Census Tract 4003, Alameda County, California   5000    3458     380
## 4      0 Census Tract 4004, Alameda County, California   3843    2551     229
## 5      0 Census Tract 4005, Alameda County, California   3786    1885     990
## 6      0 Census Tract 4006, Alameda County, California   1638     817     343
##   c_asian c_latinx state_fips county_fips tract_fips med_rent med_hhinc
## 1     592      104          6           1     400100     3501    200893
## 2     183      178          6           1     400200     1902    160536
## 3     535      364          6           1     400300     1481     94732
## 4     373      420          6           1     400400     1624    113036
## 5     264      334          6           1     400500     1627    103846
## 6     144      124          6           1     400600     1640    127232
##   c_tenants c_owners c_renters c_movers c_stay c_movelocal c_movecounty
## 1      1297     1158       139     3084   2514         202          120
## 2       855      532       323     2012   1827          40           46
## 3      2441     1057      1384     4948   4159         223          289
## 4      1750      653      1097     3754   3440         113          133
## 5      1622      605      1017     3745   3065         309          180
## 6       657      283       374     1587   1221         152          146
##   c_movestate c_moveabroad c_commute c_car c_carpool c_transit c_bike c_walk
## 1         219           29      1542   864       165       271      0     10
## 2          39           60      1211   500        40       426     62     57
## 3         156          121      2975  1252       177       835    202    171
## 4          46           22      2293   933       240       588    188     92
## 5         145           46      2325   983       156       619    177     94
## 6          68            0      1022   370       120       268     93      3
##   year   p_white    p_black    p_asian   p_latinx  p_owners p_renters    p_stay
## 1 2018 0.6597111 0.04109149 0.19004815 0.03338684 0.8928296 0.1071704 0.8151751
## 2 2018 0.7091358 0.02913580 0.09037037 0.08790123 0.6222222 0.3777778 0.9080517
## 3 2018 0.6916000 0.07600000 0.10700000 0.07280000 0.4330193 0.5669807 0.8405416
## 4 2018 0.6638043 0.05958886 0.09705959 0.10928962 0.3731429 0.6268571 0.9163559
## 5 2018 0.4978870 0.26148970 0.06973059 0.08821976 0.3729963 0.6270037 0.8184246
## 6 2018 0.4987790 0.20940171 0.08791209 0.07570208 0.4307458 0.5692542 0.7693762
##   p_movelocal p_movecounty p_movestate p_moveabroad     p_car  p_carpool
## 1  0.06549935   0.03891051  0.07101167  0.009403372 0.5603113 0.10700389
## 2  0.01988072   0.02286282  0.01938370  0.029821074 0.4128819 0.03303055
## 3  0.04506871   0.05840744  0.03152789  0.024454325 0.4208403 0.05949580
## 4  0.03010123   0.03542888  0.01225360  0.005860416 0.4068905 0.10466638
## 5  0.08251001   0.04806409  0.03871829  0.012283044 0.4227957 0.06709677
## 6  0.09577820   0.09199748  0.04284814  0.000000000 0.3620352 0.11741683
##   p_transit     p_bike      p_walk                       geometry
## 1 0.1757458 0.00000000 0.006485084 MULTIPOLYGON (((-122.2469 3...
## 2 0.3517754 0.05119736 0.047068538 MULTIPOLYGON (((-122.2574 3...
## 3 0.2806723 0.06789916 0.057478992 MULTIPOLYGON (((-122.2642 3...
## 4 0.2564326 0.08198866 0.040122111 MULTIPOLYGON (((-122.2618 3...
## 5 0.2662366 0.07612903 0.040430108 MULTIPOLYGON (((-122.2694 3...
## 6 0.2622309 0.09099804 0.002935421 MULTIPOLYGON (((-122.2681 3...
# YOUR CODE HERE

Solution hidden here!

7.2 Spatial Joins

Great! We’ve wrapped our heads around the concept of an attribute join.

Now let’s extend that concept to its spatially explicit equivalent: the spatial join!


To start, we’ll read in some other data: The Alameda County schools data.

Then we’ll work with that data and our tracts_acs_sf_ac data together.

# read in school data from a csv file
schools_df = read.csv('notebook_data/alco_schools.csv')

# promote to a spatial df
schools_sf = st_as_sf(schools_df, coords = c('X', 'Y'), crs=4326)

Let’s check if we have to transform the schools to match thetracts_acs_sf_ac’s CRS.

print('schools_sf CRS:')
## [1] "schools_sf CRS:"
print(st_crs(schools_sf))
## Coordinate Reference System:
##   User input: EPSG:4326 
##   wkt:
## GEOGCRS["WGS 84",
##     DATUM["World Geodetic System 1984",
##         ELLIPSOID["WGS 84",6378137,298.257223563,
##             LENGTHUNIT["metre",1]]],
##     PRIMEM["Greenwich",0,
##         ANGLEUNIT["degree",0.0174532925199433]],
##     CS[ellipsoidal,2],
##         AXIS["geodetic latitude (Lat)",north,
##             ORDER[1],
##             ANGLEUNIT["degree",0.0174532925199433]],
##         AXIS["geodetic longitude (Lon)",east,
##             ORDER[2],
##             ANGLEUNIT["degree",0.0174532925199433]],
##     USAGE[
##         SCOPE["unknown"],
##         AREA["World"],
##         BBOX[-90,-180,90,180]],
##     ID["EPSG",4326]]
print('tracts_acs_sf_ac CRS:')
## [1] "tracts_acs_sf_ac CRS:"
print(st_crs(tracts_acs_sf_ac))
## Coordinate Reference System:
##   User input: NAD83 
##   wkt:
## GEOGCRS["NAD83",
##     DATUM["North American Datum 1983",
##         ELLIPSOID["GRS 1980",6378137,298.257222101,
##             LENGTHUNIT["metre",1]]],
##     PRIMEM["Greenwich",0,
##         ANGLEUNIT["degree",0.0174532925199433]],
##     CS[ellipsoidal,2],
##         AXIS["latitude",north,
##             ORDER[1],
##             ANGLEUNIT["degree",0.0174532925199433]],
##         AXIS["longitude",east,
##             ORDER[2],
##             ANGLEUNIT["degree",0.0174532925199433]],
##     ID["EPSG",4269]]

Yes we do! Let’s do that.

NOTE: Explicit syntax aiming at that dataset’s CRS leaves less room for human error!

schools_sf = st_transform(schools_sf, st_crs(tracts_acs_sf_ac))

print('schools_sf CRS:')
## [1] "schools_sf CRS:"
print(st_crs(schools_sf))
## Coordinate Reference System:
##   User input: NAD83 
##   wkt:
## GEOGCRS["NAD83",
##     DATUM["North American Datum 1983",
##         ELLIPSOID["GRS 1980",6378137,298.257222101,
##             LENGTHUNIT["metre",1]]],
##     PRIMEM["Greenwich",0,
##         ANGLEUNIT["degree",0.0174532925199433]],
##     CS[ellipsoidal,2],
##         AXIS["latitude",north,
##             ORDER[1],
##             ANGLEUNIT["degree",0.0174532925199433]],
##         AXIS["longitude",east,
##             ORDER[2],
##             ANGLEUNIT["degree",0.0174532925199433]],
##     ID["EPSG",4269]]
print('tracts_acs_sf_ac CRS:')
## [1] "tracts_acs_sf_ac CRS:"
print(st_crs(tracts_acs_sf_ac))
## Coordinate Reference System:
##   User input: NAD83 
##   wkt:
## GEOGCRS["NAD83",
##     DATUM["North American Datum 1983",
##         ELLIPSOID["GRS 1980",6378137,298.257222101,
##             LENGTHUNIT["metre",1]]],
##     PRIMEM["Greenwich",0,
##         ANGLEUNIT["degree",0.0174532925199433]],
##     CS[ellipsoidal,2],
##         AXIS["latitude",north,
##             ORDER[1],
##             ANGLEUNIT["degree",0.0174532925199433]],
##         AXIS["longitude",east,
##             ORDER[2],
##             ANGLEUNIT["degree",0.0174532925199433]],
##     ID["EPSG",4269]]

Now we’re ready to combine the datasets in an analysis.

In this case, we want to get data from the census tract within which each school is located.

But how can we do that? The two datasets don’t share a common column to use for a join.

colnames(tracts_acs_sf_ac)
##  [1] "GEOID"        "STATEFP"      "COUNTYFP"     "TRACTCE"      "AFFGEOID"    
##  [6] "NAME.x"       "LSAD"         "ALAND"        "AWATER"       "NAME.y"      
## [11] "c_race"       "c_white"      "c_black"      "c_asian"      "c_latinx"    
## [16] "state_fips"   "county_fips"  "tract_fips"   "med_rent"     "med_hhinc"   
## [21] "c_tenants"    "c_owners"     "c_renters"    "c_movers"     "c_stay"      
## [26] "c_movelocal"  "c_movecounty" "c_movestate"  "c_moveabroad" "c_commute"   
## [31] "c_car"        "c_carpool"    "c_transit"    "c_bike"       "c_walk"      
## [36] "year"         "p_white"      "p_black"      "p_asian"      "p_latinx"    
## [41] "p_owners"     "p_renters"    "p_stay"       "p_movelocal"  "p_movecounty"
## [46] "p_movestate"  "p_moveabroad" "p_car"        "p_carpool"    "p_transit"   
## [51] "p_bike"       "p_walk"       "geometry"
colnames(schools_sf)
## [1] "Site"     "Address"  "City"     "State"    "Type"     "API"      "Org"     
## [8] "geometry"

However, they do have a shared relationship by way of space!

So, we’ll use a spatial relationship query to figure out the census tract that each school is in, then associate the tract’s data with that school (as additional data in the school’s row). This is a spatial join!

Census Tract Data Associated with Each School

In this case, let’s say we’re interested in the relationship between the median household income in a census tract (tracts_acs_sf_ac$med_hhinc) and a school’s Academic Performance Index (schools_gdf$API).

To start, let’s take a look at the distributions of our two variables of interest.

head(tracts_acs_sf_ac)
## Simple feature collection with 6 features and 52 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -122.2694 ymin: 37.83454 xmax: -122.2124 ymax: 37.88544
## Geodetic CRS:  NAD83
##         GEOID STATEFP COUNTYFP TRACTCE             AFFGEOID NAME.x LSAD   ALAND
## 1 06001400100      06      001  400100 1400000US06001400100   4001   CT 6894340
## 2 06001400200      06      001  400200 1400000US06001400200   4002   CT  586561
## 3 06001400300      06      001  400300 1400000US06001400300   4003   CT 1105851
## 4 06001400400      06      001  400400 1400000US06001400400   4004   CT  715616
## 5 06001400500      06      001  400500 1400000US06001400500   4005   CT  590307
## 6 06001400600      06      001  400600 1400000US06001400600   4006   CT  297856
##   AWATER                                        NAME.y c_race c_white c_black
## 1      0 Census Tract 4001, Alameda County, California   3115    2055     128
## 2      0 Census Tract 4002, Alameda County, California   2025    1436      59
## 3      0 Census Tract 4003, Alameda County, California   5000    3458     380
## 4      0 Census Tract 4004, Alameda County, California   3843    2551     229
## 5      0 Census Tract 4005, Alameda County, California   3786    1885     990
## 6      0 Census Tract 4006, Alameda County, California   1638     817     343
##   c_asian c_latinx state_fips county_fips tract_fips med_rent med_hhinc
## 1     592      104          6           1     400100     3501    200893
## 2     183      178          6           1     400200     1902    160536
## 3     535      364          6           1     400300     1481     94732
## 4     373      420          6           1     400400     1624    113036
## 5     264      334          6           1     400500     1627    103846
## 6     144      124          6           1     400600     1640    127232
##   c_tenants c_owners c_renters c_movers c_stay c_movelocal c_movecounty
## 1      1297     1158       139     3084   2514         202          120
## 2       855      532       323     2012   1827          40           46
## 3      2441     1057      1384     4948   4159         223          289
## 4      1750      653      1097     3754   3440         113          133
## 5      1622      605      1017     3745   3065         309          180
## 6       657      283       374     1587   1221         152          146
##   c_movestate c_moveabroad c_commute c_car c_carpool c_transit c_bike c_walk
## 1         219           29      1542   864       165       271      0     10
## 2          39           60      1211   500        40       426     62     57
## 3         156          121      2975  1252       177       835    202    171
## 4          46           22      2293   933       240       588    188     92
## 5         145           46      2325   983       156       619    177     94
## 6          68            0      1022   370       120       268     93      3
##   year   p_white    p_black    p_asian   p_latinx  p_owners p_renters    p_stay
## 1 2018 0.6597111 0.04109149 0.19004815 0.03338684 0.8928296 0.1071704 0.8151751
## 2 2018 0.7091358 0.02913580 0.09037037 0.08790123 0.6222222 0.3777778 0.9080517
## 3 2018 0.6916000 0.07600000 0.10700000 0.07280000 0.4330193 0.5669807 0.8405416
## 4 2018 0.6638043 0.05958886 0.09705959 0.10928962 0.3731429 0.6268571 0.9163559
## 5 2018 0.4978870 0.26148970 0.06973059 0.08821976 0.3729963 0.6270037 0.8184246
## 6 2018 0.4987790 0.20940171 0.08791209 0.07570208 0.4307458 0.5692542 0.7693762
##   p_movelocal p_movecounty p_movestate p_moveabroad     p_car  p_carpool
## 1  0.06549935   0.03891051  0.07101167  0.009403372 0.5603113 0.10700389
## 2  0.01988072   0.02286282  0.01938370  0.029821074 0.4128819 0.03303055
## 3  0.04506871   0.05840744  0.03152789  0.024454325 0.4208403 0.05949580
## 4  0.03010123   0.03542888  0.01225360  0.005860416 0.4068905 0.10466638
## 5  0.08251001   0.04806409  0.03871829  0.012283044 0.4227957 0.06709677
## 6  0.09577820   0.09199748  0.04284814  0.000000000 0.3620352 0.11741683
##   p_transit     p_bike      p_walk                       geometry
## 1 0.1757458 0.00000000 0.006485084 MULTIPOLYGON (((-122.2469 3...
## 2 0.3517754 0.05119736 0.047068538 MULTIPOLYGON (((-122.2574 3...
## 3 0.2806723 0.06789916 0.057478992 MULTIPOLYGON (((-122.2642 3...
## 4 0.2564326 0.08198866 0.040122111 MULTIPOLYGON (((-122.2618 3...
## 5 0.2662366 0.07612903 0.040430108 MULTIPOLYGON (((-122.2694 3...
## 6 0.2622309 0.09099804 0.002935421 MULTIPOLYGON (((-122.2681 3...
hist(tracts_acs_sf_ac$med_hhinc)

hist(schools_sf$API)

Oh, right! Those pesky schools with no reported APIs (i.e. API == 0)! Let’s drop those.

schools_sf_api = schools_sf[schools_sf$API > 0, ]
hist(schools_sf_api$API)

Much better!

Now, maybe we think there ought to be some correlation between the two variables? As a first pass at this possibility, let’s overlay the two datasets, coloring each one by its variable of interest. This should give us a sense of whether or not similar values co-occur.

tmap_mode('view')
## tmap mode set to interactive viewing
tm_shape(tracts_acs_sf_ac) + 
  tm_polygons(col = 'med_hhinc',
              border.col="white",
             palette = 'RdYlGn',
             style="jenks") + 
tm_shape(schools_sf_api) + 
  tm_dots(col = 'API',
          palette = 'RdYlGn',
          border.col="black",
          style="jenks",
          size = 0.15)

Spatially Joining our Schools and Census Tracts

Though it’s hard to say for sure, it certainly looks possible. It would be ideal to scatterplot the variables! But in order to do that, we need to know the median household income in each school’s tract, which means we definitely need our spatial join!

We’ll first take a look at the documentation for the spatial join function, st_join.

?st_join

Looks like the key arguments to consider are:

  • the two sf data.frames to be spatially joined (x and y)

  • the type of join to execute (left=), which is a left join if TRUE (default), or an inner join if FALSE

  • the spatial relationship query to use in the join (join=), which by default is st_intersects

NOTES:

  • By default st_join is a left join

  • By default st_join maintains the geometries of the first data.frame input to the operation (i.e. the geometries of x).

When spatially joining two sf dataframes with st_join, the spatial dataframe whose geometry you want to keep should be listed first!

Question

  1. Which sf data.frame are we joining onto which (i.e. which one is getting the other one’s data added to it)?
  2. What happened to ‘outer’ as a join type?
  3. Thus, in our operation, which sf data.frame should be x, which should be y, and should left be TRUE or FALSE?

Alright! Let’s run our join!

schools_jointracts = st_join(schools_sf_api, tracts_acs_sf_ac)
## although coordinates are longitude/latitude, st_intersects assumes that they are planar
## although coordinates are longitude/latitude, st_intersects assumes that they are planar
# We don't need to specify default arguments!
#schools_jointracts = st_join(schools_sf_api, tracts_acs_sf_ac, left=T, join=st_within)

Checking Our Output


Question

As always, we want to sanity-check our intermediate result before we rush ahead.

One way to do that is to introspect the structure of the result object a bit.

  1. What type of object should that have given us?
  2. What should the dimensions of that object be, and why?
  3. If we wanted a visual check of our results (i.e. a plot or map), what could we do?
print(dim(schools_jointracts))   # the join output
## [1] 325  60
print(dim(schools_sf_api))       # the input schools
## [1] 325   8
print(dim(tracts_acs_sf_ac))     # the input tracts
## [1] 360  53
head(schools_jointracts)
## Simple feature collection with 6 features and 59 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -122.2616 ymin: 37.739 xmax: -122.2348 ymax: 37.76911
## Geodetic CRS:  NAD83
##                        Site               Address    City State Type API    Org
## 1 Amelia Earhart Elementary 400 Packet Landing Rd Alameda    CA   ES 933 Public
## 2       Bay Farm Elementary   200 Aughinbaugh Way Alameda    CA   ES 932 Public
## 3  Donald D. Lum Elementary    1801 Sandcreek Way Alameda    CA   ES 853 Public
## 4         Edison Elementary  2700 Buena Vista Ave Alameda    CA   ES 927 Public
## 5     Frank Otis Elementary      3010 Fillmore St Alameda    CA   ES 894 Public
## 6       Franklin Elementary  1433 San Antonio Ave Alameda    CA   ES 893 Public
##         GEOID STATEFP COUNTYFP TRACTCE             AFFGEOID  NAME.x LSAD
## 1 06001428302      06      001  428302 1400000US06001428302 4283.02   CT
## 2 06001428302      06      001  428302 1400000US06001428302 4283.02   CT
## 3 06001428500      06      001  428500 1400000US06001428500    4285   CT
## 4 06001427100      06      001  427100 1400000US06001427100    4271   CT
## 5 06001428200      06      001  428200 1400000US06001428200    4282   CT
## 6 06001427900      06      001  427900 1400000US06001427900    4279   CT
##     ALAND AWATER                                           NAME.y c_race
## 1 2234425 474770 Census Tract 4283.02, Alameda County, California   7501
## 2 2234425 474770 Census Tract 4283.02, Alameda County, California   7501
## 3  624197 383522    Census Tract 4285, Alameda County, California   3152
## 4 1025446 168187    Census Tract 4271, Alameda County, California   3768
## 5 1362067 804736    Census Tract 4282, Alameda County, California   6643
## 6  825214      0    Census Tract 4279, Alameda County, California   5031
##   c_white c_black c_asian c_latinx state_fips county_fips tract_fips med_rent
## 1    3114     273    3070      562          6           1     428302     1946
## 2    3114     273    3070      562          6           1     428302     1946
## 3    1293     268     940      378          6           1     428500     1873
## 4    2446      67     574      457          6           1     427100     1767
## 5    3468     387    1589      686          6           1     428200     1855
## 6    2554     228    1407      563          6           1     427900     1712
##   med_hhinc c_tenants c_owners c_renters c_movers c_stay c_movelocal
## 1    147667      2540     2015       525     7436   6705         395
## 2    147667      2540     2015       525     7436   6705         395
## 3     88333      1422      485       937     3125   2641         282
## 4    135833      1420     1040       380     3724   3498         121
## 5    103781      2641     1468      1173     6587   6155         205
## 6    102077      2030      734      1296     4971   4470         108
##   c_movecounty c_movestate c_moveabroad c_commute c_car c_carpool c_transit
## 1           99         175           62      3812  2595       296       409
## 2           99         175           62      3812  2595       296       409
## 3          102         100            0      1514   910        65       374
## 4           86           0           19      1755   986       136       303
## 5           71         141           15      3391  2189       229       510
## 6          244         130           19      3102  1705       380       624
##   c_bike c_walk year   p_white    p_black   p_asian   p_latinx  p_owners
## 1     18     73 2018 0.4151446 0.03639515 0.4092788 0.07492334 0.7933071
## 2     18     73 2018 0.4151446 0.03639515 0.4092788 0.07492334 0.7933071
## 3     50     18 2018 0.4102157 0.08502538 0.2982234 0.11992386 0.3410689
## 4     33     64 2018 0.6491507 0.01778132 0.1523355 0.12128450 0.7323944
## 5     51    108 2018 0.5220533 0.05825681 0.2391992 0.10326660 0.5558501
## 6     41     52 2018 0.5076526 0.04531902 0.2796661 0.11190618 0.3615764
##   p_renters    p_stay p_movelocal p_movecounty p_movestate p_moveabroad
## 1 0.2066929 0.9016945  0.05311996   0.01331361  0.02353416  0.008337816
## 2 0.2066929 0.9016945  0.05311996   0.01331361  0.02353416  0.008337816
## 3 0.6589311 0.8451200  0.09024000   0.03264000  0.03200000  0.000000000
## 4 0.2676056 0.9393126  0.03249194   0.02309345  0.00000000  0.005102041
## 5 0.4441499 0.9344163  0.03112191   0.01077881  0.02140580  0.002277213
## 6 0.6384236 0.8992154  0.02172601   0.04908469  0.02615168  0.003822169
##       p_car  p_carpool p_transit      p_bike     p_walk
## 1 0.6807450 0.07764953 0.1072928 0.004721931 0.01915005
## 2 0.6807450 0.07764953 0.1072928 0.004721931 0.01915005
## 3 0.6010568 0.04293263 0.2470277 0.033025099 0.01188904
## 4 0.5618234 0.07749288 0.1726496 0.018803419 0.03646724
## 5 0.6455323 0.06753170 0.1503981 0.015039811 0.03184901
## 6 0.5496454 0.12250161 0.2011605 0.013217279 0.01676338
##                     geometry
## 1 POINT (-122.2388 37.74476)
## 2   POINT (-122.2519 37.739)
## 3 POINT (-122.2589 37.76206)
## 4 POINT (-122.2348 37.76525)
## 5 POINT (-122.2381 37.75396)
## 6 POINT (-122.2616 37.76911)

Confirmed! The output of the our st_join operation is an sf data.frame (schools_jointracts) with:

  • a row for each school that is located inside a census tract (all of them are)
  • the point geometry of that school
  • all of the attribute data columns (non-geometry columns) from both input sf data.frames

Let’s also take a look at an overlay map of the schools on the tracts. If we color the schools categorically by their tracts IDs, then we should see that all schools within a given tract polygon are the same color.

tm_shape(tracts_acs_sf_ac) + 
  tm_polygons(col='white', border.col='black') + 
tm_shape(schools_jointracts) + 
  tm_dots(col='GEOID', size=0.2)
## Warning: Number of levels of the variable "GEOID" is 208, which is
## larger than max.categories (which is 30), so levels are combined. Set
## tmap_options(max.categories = 208) in the layer function to show all levels.

Assessing the Relationship between Median Household Income and API

Fantastic! That looks right!

Now we can create that scatterplot we were thinking about!

plot(schools_jointracts$med_hhinc, schools_jointracts$API,
     xlab = 'median household income ($)',
     ylab = 'API')

Wow! Just as we suspected based on our overlay map, there’s a pretty obvious, strong, and positive correlation between median household income in a school’s tract and the school’s API.

Spatial Join Revisited

Keep it simple

Instead of joining everything in the ACS data to the schools you can just pick one var!

#simple spatial join
schools_jointracts2 = st_join(schools_sf_api, tracts_acs_sf_ac['med_hhinc'])
## although coordinates are longitude/latitude, st_intersects assumes that they are planar
## although coordinates are longitude/latitude, st_intersects assumes that they are planar
head(schools_jointracts2)
## Simple feature collection with 6 features and 8 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -122.2616 ymin: 37.739 xmax: -122.2348 ymax: 37.76911
## Geodetic CRS:  NAD83
##                        Site               Address    City State Type API    Org
## 1 Amelia Earhart Elementary 400 Packet Landing Rd Alameda    CA   ES 933 Public
## 2       Bay Farm Elementary   200 Aughinbaugh Way Alameda    CA   ES 932 Public
## 3  Donald D. Lum Elementary    1801 Sandcreek Way Alameda    CA   ES 853 Public
## 4         Edison Elementary  2700 Buena Vista Ave Alameda    CA   ES 927 Public
## 5     Frank Otis Elementary      3010 Fillmore St Alameda    CA   ES 894 Public
## 6       Franklin Elementary  1433 San Antonio Ave Alameda    CA   ES 893 Public
##   med_hhinc                   geometry
## 1    147667 POINT (-122.2388 37.74476)
## 2    147667   POINT (-122.2519 37.739)
## 3     88333 POINT (-122.2589 37.76206)
## 4    135833 POINT (-122.2348 37.76525)
## 5    103781 POINT (-122.2381 37.75396)
## 6    102077 POINT (-122.2616 37.76911)

7.3: Aggregation

We just saw that a spatial join in one way to leverage the spatial relationship between two datasets in order to create a new, composite dataset.

An aggregation is another way we can generate new data from this relationship. In this case, for each feature in one dataset we find all the features in another dataset that satisfy our chosen spatial relationship query with it (e.g. within, intersects), then aggregate one or more of the joined output variables using some summary function (e.g. count, mean).

Calculating the Mean API within each Census Tract

In our spatial join exercise the output suggested that API scores are related to median household income. However, many census tracts have more than one school. So, a mean API score per census tract would be a better way to explore this relationship. Let’s use a spatial data aggregation to calculate that value.

tracts_meanAPI = sf:::aggregate.sf(x=schools_sf_api['API'], by=tracts_acs_sf_ac, FUN=mean)
## although coordinates are longitude/latitude, st_intersects assumes that they are planar
head(tracts_meanAPI)
## Simple feature collection with 6 features and 1 field
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -122.2694 ymin: 37.83454 xmax: -122.2124 ymax: 37.88544
## Geodetic CRS:  NAD83
##   API                       geometry
## 1 864 MULTIPOLYGON (((-122.2469 3...
## 2 703 MULTIPOLYGON (((-122.2574 3...
## 3  NA MULTIPOLYGON (((-122.2642 3...
## 4 892 MULTIPOLYGON (((-122.2618 3...
## 5  NA MULTIPOLYGON (((-122.2694 3...
## 6  NA MULTIPOLYGON (((-122.2681 3...

Let’s plot the results

# plot the tracts, coloring them by *mean* API
tm_shape(tracts_acs_sf_ac) + 
  tm_polygons(col = 'med_hhinc',
              border.col='white',
             palette = 'RdYlGn',
             style = 'jenks',
             title = 'Median HHInc by Tract') + 
  
# Now plot the tracts as points, colored by med HHI
tm_shape(tracts_meanAPI) + 
  tm_dots(col='API',
          border.col='black',
          palette='RdYlGn',
          style='jenks',
          size=0.1,
          title='Mean API by Tract')

Now let’s join our spatially aggregated API Mean back to the census tract sf dataframe tracts_acs_sf_ac so that we can create a scatter plot of the data.

#simple spatial join of one var
# first rename the col before the join bc alread have an API col
colnames(tracts_meanAPI) <- c('mean_API','geometry')
tracts_acs_sf_ac2 = st_join(tracts_acs_sf_ac, tracts_meanAPI['mean_API'])
## although coordinates are longitude/latitude, st_intersects assumes that they are planar
plot(tracts_acs_sf_ac2$med_hhinc, tracts_acs_sf_ac2$mean_API,
     xlab = 'median household income ($)',
     ylab = 'Mean API')

Exercise: Spatial Joins and Aggregations

Getting the Aggregated School Counts

In this exercise, you will practice spatial aggregation and spatial joins.

  1. Aggregate the number of schools (schools_sf) within each census tract with the sum function.
  2. Join the count back to the Alameda County census tract data frame tracts_acs_sf_ac2

Note: to make this easier, we will add a dummy variable “count” to the schools data and set it to 1.

schools_sf$school_count = 1  # Add a new variable
head(schools_sf)# Take a look
## Simple feature collection with 6 features and 8 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -122.2616 ymin: 37.739 xmax: -122.2348 ymax: 37.76911
## Geodetic CRS:  NAD83
##                        Site               Address    City State Type API    Org
## 1 Amelia Earhart Elementary 400 Packet Landing Rd Alameda    CA   ES 933 Public
## 2       Bay Farm Elementary   200 Aughinbaugh Way Alameda    CA   ES 932 Public
## 3  Donald D. Lum Elementary    1801 Sandcreek Way Alameda    CA   ES 853 Public
## 4         Edison Elementary  2700 Buena Vista Ave Alameda    CA   ES 927 Public
## 5     Frank Otis Elementary      3010 Fillmore St Alameda    CA   ES 894 Public
## 6       Franklin Elementary  1433 San Antonio Ave Alameda    CA   ES 893 Public
##                     geometry school_count
## 1 POINT (-122.2388 37.74476)            1
## 2   POINT (-122.2519 37.739)            1
## 3 POINT (-122.2589 37.76206)            1
## 4 POINT (-122.2348 37.76525)            1
## 5 POINT (-122.2381 37.75396)            1
## 6 POINT (-122.2616 37.76911)            1

Your code here!

# YOUR CODE HERE

# Aggregate the number of schools (`schools_sf`) within each census tract with the `sum` function
# And save to a new sf dataframe called `school_counts_by_tract`
school_counts_by_tract = sf:::aggregate.sf(x=schools_sf['school_count'], by=tracts_acs_sf_ac, FUN=sum)
## although coordinates are longitude/latitude, st_intersects assumes that they are planar
# Take a look
head(school_counts_by_tract)
## Simple feature collection with 6 features and 1 field
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -122.2694 ymin: 37.83454 xmax: -122.2124 ymax: 37.88544
## Geodetic CRS:  NAD83
##   school_count                       geometry
## 1            1 MULTIPOLYGON (((-122.2469 3...
## 2            1 MULTIPOLYGON (((-122.2574 3...
## 3           NA MULTIPOLYGON (((-122.2642 3...
## 4            2 MULTIPOLYGON (((-122.2618 3...
## 5            1 MULTIPOLYGON (((-122.2694 3...
## 6           NA MULTIPOLYGON (((-122.2681 3...
# Join the count back to the Alameda County census tract data frame `tracts_acs_sf_ac2`
tracts_acs_sf_ac2 = st_join(tracts_acs_sf_ac2, school_counts_by_tract['school_count'])
## although coordinates are longitude/latitude, st_intersects assumes that they are planar
# Take a look at the output
head(tracts_acs_sf_ac2)
## Simple feature collection with 6 features and 54 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -122.2469 ymin: 37.84996 xmax: -122.2124 ymax: 37.88544
## Geodetic CRS:  NAD83
##            GEOID STATEFP COUNTYFP TRACTCE             AFFGEOID NAME.x LSAD
## 1    06001400100      06      001  400100 1400000US06001400100   4001   CT
## 1.8  06001400100      06      001  400100 1400000US06001400100   4001   CT
## 1.9  06001400100      06      001  400100 1400000US06001400100   4001   CT
## 1.10 06001400100      06      001  400100 1400000US06001400100   4001   CT
## 1.11 06001400100      06      001  400100 1400000US06001400100   4001   CT
## 1.12 06001400100      06      001  400100 1400000US06001400100   4001   CT
##        ALAND AWATER                                        NAME.y c_race
## 1    6894340      0 Census Tract 4001, Alameda County, California   3115
## 1.8  6894340      0 Census Tract 4001, Alameda County, California   3115
## 1.9  6894340      0 Census Tract 4001, Alameda County, California   3115
## 1.10 6894340      0 Census Tract 4001, Alameda County, California   3115
## 1.11 6894340      0 Census Tract 4001, Alameda County, California   3115
## 1.12 6894340      0 Census Tract 4001, Alameda County, California   3115
##      c_white c_black c_asian c_latinx state_fips county_fips tract_fips
## 1       2055     128     592      104          6           1     400100
## 1.8     2055     128     592      104          6           1     400100
## 1.9     2055     128     592      104          6           1     400100
## 1.10    2055     128     592      104          6           1     400100
## 1.11    2055     128     592      104          6           1     400100
## 1.12    2055     128     592      104          6           1     400100
##      med_rent med_hhinc c_tenants c_owners c_renters c_movers c_stay
## 1        3501    200893      1297     1158       139     3084   2514
## 1.8      3501    200893      1297     1158       139     3084   2514
## 1.9      3501    200893      1297     1158       139     3084   2514
## 1.10     3501    200893      1297     1158       139     3084   2514
## 1.11     3501    200893      1297     1158       139     3084   2514
## 1.12     3501    200893      1297     1158       139     3084   2514
##      c_movelocal c_movecounty c_movestate c_moveabroad c_commute c_car
## 1            202          120         219           29      1542   864
## 1.8          202          120         219           29      1542   864
## 1.9          202          120         219           29      1542   864
## 1.10         202          120         219           29      1542   864
## 1.11         202          120         219           29      1542   864
## 1.12         202          120         219           29      1542   864
##      c_carpool c_transit c_bike c_walk year   p_white    p_black   p_asian
## 1          165       271      0     10 2018 0.6597111 0.04109149 0.1900482
## 1.8        165       271      0     10 2018 0.6597111 0.04109149 0.1900482
## 1.9        165       271      0     10 2018 0.6597111 0.04109149 0.1900482
## 1.10       165       271      0     10 2018 0.6597111 0.04109149 0.1900482
## 1.11       165       271      0     10 2018 0.6597111 0.04109149 0.1900482
## 1.12       165       271      0     10 2018 0.6597111 0.04109149 0.1900482
##        p_latinx  p_owners p_renters    p_stay p_movelocal p_movecounty
## 1    0.03338684 0.8928296 0.1071704 0.8151751  0.06549935   0.03891051
## 1.8  0.03338684 0.8928296 0.1071704 0.8151751  0.06549935   0.03891051
## 1.9  0.03338684 0.8928296 0.1071704 0.8151751  0.06549935   0.03891051
## 1.10 0.03338684 0.8928296 0.1071704 0.8151751  0.06549935   0.03891051
## 1.11 0.03338684 0.8928296 0.1071704 0.8151751  0.06549935   0.03891051
## 1.12 0.03338684 0.8928296 0.1071704 0.8151751  0.06549935   0.03891051
##      p_movestate p_moveabroad     p_car p_carpool p_transit p_bike      p_walk
## 1     0.07101167  0.009403372 0.5603113 0.1070039 0.1757458      0 0.006485084
## 1.8   0.07101167  0.009403372 0.5603113 0.1070039 0.1757458      0 0.006485084
## 1.9   0.07101167  0.009403372 0.5603113 0.1070039 0.1757458      0 0.006485084
## 1.10  0.07101167  0.009403372 0.5603113 0.1070039 0.1757458      0 0.006485084
## 1.11  0.07101167  0.009403372 0.5603113 0.1070039 0.1757458      0 0.006485084
## 1.12  0.07101167  0.009403372 0.5603113 0.1070039 0.1757458      0 0.006485084
##      mean_API school_count                       geometry
## 1         864            1 MULTIPOLYGON (((-122.2469 3...
## 1.8       864            1 MULTIPOLYGON (((-122.2469 3...
## 1.9       864            2 MULTIPOLYGON (((-122.2469 3...
## 1.10      864           NA MULTIPOLYGON (((-122.2469 3...
## 1.11      864           NA MULTIPOLYGON (((-122.2469 3...
## 1.12      864           NA MULTIPOLYGON (((-122.2469 3...
# map the output with plot
plot(tracts_acs_sf_ac2['school_count'])

Solution hidden here!

7.4 Recap

We discussed how we can combine datasets to enhance any geospatial data analyses you could do. Key concepts include:

  • Attribute joins
    • merge()
  • Spatial joins (order matters!)
    • st_join
  • Aggregation
    • aggregate.sf

 D-Lab @ University of California - Berkeley
 Team Geo